The slWaC Corpus of the SloveneWeb

نویسندگان

Tomaz Erjavec

Nikola Ljubesic

Natasa Logar

چکیده

The availability of large collections of text (language corpora) is crucial for empirically supported linguistic investigations of various languages; however, such corpora are complicated and expensive to collect. In recent years corpora made from texts on the World Wide Web have become an attractive alternative to traditional corpora, as they can be made automatically, contain varied text types of contemporary language, and are quite large. The paper describes version 2 of slWaC, a Web corpus of Slovene containing 1.2 billion tokens. The corpus extends the first version of slWaC with new materials and updates the corpus compilation pipeline. The paper describes the process of corpus compilation with a focus on near-duplicate removal, presents the linguistic annotation, format and accessibility of the corpus via Web concordancers. It then investigates the content of the corpus using the method of frequency profiling, by comparing its lemma and part-of-speech annotations with three corpora: the first version of slWaC, with Gigafida, the one billion word reference corpus of Slovene, and KRES, the hundred million word reference balanced corpus of Slovene.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The slWaC 2.0 Corpus of the Slovene Web

Web corpora have become an attractive source of linguistic content, as they can be made automatically, contain varied text types of contemporary language, and are quite large. This paper introduces version 2 of slWaC, a web corpus of Slovene containing 1.2 billion tokens. The corpus extends the first version of slWaC with new materials and updates the corpus compilation pipeline. The paper desc...

متن کامل

hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene

Web corpora have become an attractive source of linguistic content, yet are for many languages still not available. This paper introduces two new annotated web corpora: the Croatian hrWaC and the Slovene slWaC. Both were built using a modified standard “Web as Corpus” pipeline having in mind the limited amount of available web data. The modifications are described in the paper, focusing on the ...

متن کامل

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...

متن کامل

Hedges in English for Academic Purposes: A Corpus-based study of Iranian EFL learners

Hedges, as tools to express tentativeness and doubt, have been studied in plenty of research papers in the Iranian EFL research setting. However, their use in a learner corpus, portraying Iranian learner English, is in need of more research attention. With this end in view, this study aimed at investigating how Iranian EFL learners who have majored in English-related fields in Iran deployed hed...

متن کامل

P-69: Expression of Leptin Receptor mRNA in Ovine Corpus Luteum

Background: Many hormones are involved in the regulation of reproduction. Leptin hormone which is mainly secreted by adipose tissue plays an important role in energy homeostasis and reproduction. It seems that leptin is an important linkage between body metabolism and reproductive system. Moreover, it has been shown that leptin and leptin receptor express in reproductive organs of some species....

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Informatica (Slovenia)

دوره 39 شماره

صفحات -

تاریخ انتشار 2015

The slWaC Corpus of the SloveneWeb

نویسندگان

چکیده

منابع مشابه

The slWaC 2.0 Corpus of the Slovene Web

hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Hedges in English for Academic Purposes: A Corpus-based study of Iranian EFL learners

P-69: Expression of Leptin Receptor mRNA in Ovine Corpus Luteum

عنوان ژورنال:

اشتراک گذاری